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The equations of evolutionary change by natural selection are commonly expressed in statistical 
terms. Fisher's fundamental theorem emphasizes the variance in fitness. Quantitative genetics 
expresses selection with covariances and regressions. Population genetic equations depend on genetic 
variances. How can we read those statistical expressions with respect to the meaning of natural 
selection? One possibility is to relate the statistical expressions to the amount of information that 
populations accumulate by selection. However, the connection between selection and information 
theory has never been compelling. Here, I show the correct relations between statistical expressions 
for selection and information theory expressions for selection. Those relations link selection to 
the fundamental concepts of entropy and information in the theories of physics, statistics, and 
communication. We can now read the equations of selection in terms of their natural meaning. 
Selection causes populations to accumulate information about the environmenlnjj 



There are difficulties in applying information 
theory in genetics. They arise principally, not 
in the transmission of information, but in its 
meaning [H p. 181]. 

INTRODUCTION 

I show that natural selection can be described by the 
same measure of information that provides the concep- 
tual foundations of physics, statistics and communica- 
tion. Briefly, the argument runs as follows. The classical 
models of selection express evolutionary rates in propor- 
tion to the variance in fitness. The variance in fitness is 
equivalent to a symmetric form of the Kullback-Leibler 
information that the population acquires about the en- 
vironment through the changes in gene frequency caused 
by selection. 

Kullback-Leibler information is closely related to 
Fisher information, likelihood, and Bayesian updating 
from statistics, as well as Shannon information and the 
measures of entropy that arise as the fundamental quan- 
tities of communication theory and physics. Thus, the 
common variances and covariances of evolutionary mod- 
els are equivalent to the fundamental measures of infor- 
mation that arise in many different fields of study. 

In Fisher's fundamental theorem of natural selection, 
the rate of increase in fitness caused by natural selec- 
tion is equal to the genetic variance in fitness. Equiv- 
alcntly, the rate of increase in fitness is proportional to 
the amount of information that the population acquires 
about the environment [2]. 

In my view, information is a primary quantity with in- 
tuitive meaning in the study of selection, whereas the ge- 
netic variance just happens to be an algebraic equivalence 
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for the measure of information. The history of evolution- 
ary theory has it backwards, using statistical expressions 
of variances and covariances in place of the equivalent and 
more meaningful expressions of information. To read the 
fundamental equations of evolutionary change, one must 
learn to interpret the standard expressions of variances 
and covariances as expressions of information. 



OVERVIEW 

The first section reviews the classic statistical expres- 
sions for selection. Evolutionary change caused by se- 
lection is the covariance between fitness and character 
value. That covariance equals the regression of character 
value on fitness multiplied by the variance in fitness. 

The second section expresses selection in terms of the 
classic equations from information theory (see Box [2]). I 
show that the change in the mean logarithm of fitness 
is the Jeffreys information divergence. That divergence 
measures the accumulation of information by natural se- 
lection between the initial population and the population 
after it has been updated by selection. The relations be- 
tween the statistical and information perspectives follow 
by connecting the classic statistical expressions of selec- 
tion to the new information description for selection. 

The third section analyzes the Jeffreys divergence as 
the measure of information in the fundamental equations 
of selection. The Jeffreys divergence is the sum of two 
expressions for relative entropy. Relative entropy, known 
as the Kullback-Leibler divergence, measures the gain in 
information with regard to an abstract and universal no- 
tion of encoding, independently of the meaning of that in- 
formation. A universal, abstract measure of information 
in terms of encoding allows a general theory of informa- 
tion to provide the foundation for the deepest concepts 
in communication, physics and statistics. 

The fourth section concerns the meaning of informa- 
tion. Although encoding provides a useful measure with 
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Box 1. Topics in the theory of natural selection 

This article is part of a series on natural selection. Although 
the theory of natural selection is simple, it remains endlessly 
contentious and difficult to apply. My goal is to make more ac- 
cessible the concepts that are so important, yet either mostly 
unknown or widely misunderstood. I write in a nontechni- 
cal style, showing the key equations and results rather than 
providing full derivations or discussions of mathematical prob- 
lems. Boxes list technical issues and brief summaries of the 
literature. 



regard to information theory, we must also interpret the 
meaning of that information in terms of selection. Mean- 
ing arises by the relation of encoded information to what- 
ever scale we use to interpret a particular problem. For 
selection, we interpret meaning with regard to charac- 
ters. Characters may be gene frequencies or measure- 
ments made on individuals. Characters lead to a general 
notion of the scale for meaning with respect to the scale 
of encoded information. 

The fifth section explicitly connects the abstract scale 
of encoded information to the meaningful scale of infor- 
mation in problems of selection. The analysis leads to the 
relation between the Jeffreys divergence, the most gen- 
eral expression for selection, and Fisher information as 
the limiting form of the Jeffreys divergence when changes 
in magnitude are small. Fisher information is the sensi- 
tivity of changes in abstract encoded information relative 
to the distance that one moves along a scale of meaning. 
Encoded information is equivalent to the log-likelihood 
ratio, which is why Fisher information provides the con- 
ceptual foundations for the theory of statistics. 

The sixth section uses Fisher information to derive var- 
ious elegant expressions for selection. For example, sup- 
pose that changes in the average value of a character suf- 
ficiently describe the changes caused by selection. Then 
mean log fitness increases by the Fisher information in an 
observation about the average character value multiplied 
by the squared change in the average character value. 
This expression connects the scale of encoded informa- 
tion, which is mean log fitness, to the scale of meaning, 
which in this case is the average value of a character in 
the population. 

The seventh section relates the parametric description 
of characters to a more general nonparametric expres- 
sion. In the previous example, the change caused by 
selection was described fully by a change in a parameter, 
the mean In the general case, no parametric summary 
statistics fully capture the change in populations. In- 
stead, one must use the full range of different types in 
the population, providing a nonparametric description of 
the change in the distribution of frequencies caused by 
selection. The full nonparametric expression shows the 
universal applicability of the equations selection and in- 
formation. 



Box 2. Information, entropy and complexity 

Cover and Thomas [3] give an excellent introduction to infor- 
mation theory and its applications. Jaynes [3] is a fascinat- 
ing analysis of the connections between information, entropy, 
probability, Bayesian analysis, and statistical inference. Kull- 
back [5] is a broad synthesis of information theory in relation 
to classical statistics. Fisher's [H] [7j original papers on the 
theoretical foundations of statistics set the basis for all fu- 
ture work on information and statistics, with the 1925 paper 
showing the key role of Fisher information. 

Entropy arose in the study of thermodynamics [8l410j . 
Ben-Nairn [TT] gives a simple introduction. Hill 12 provides 
a classical text. Information theory arose in Fisher's work and 
separately in the study of communication through the analy- 
ses of Hartley |13| and Shannon [141 115} . The underlying con- 
cepts of entropy and information are very close. Some think 
the concepts are identical, but controversy remains [4lll6]. 

Jeffreys [17] divergence first appeared in an attempt to 
derive prior distributions for use in Bayesian analysis rather 
than as the sort of divergence used in this article. Kullback 
and Leibler [IB] and Kullback [5] presented both the asym- 
metric divergence T>, given in Eq. which is now known 
as the Kullback- Lei bler divergence, and the symmetric form, 
J, given in Eq. (12 1, which is now known as the Jeffreys di- 
vergence. They noted Jeffreys' previous usage of J in the 
context of Bayesian priors, and then developed the impor- 
tance of the divergence interpretation for statistical theory, 
particularly the asymmetric form, T>. 

I do not discuss Kolmogorov complexity in this article. 
However, it is an important concept that may ultimately 
prove as interesting for biological applications as the classic 
analyses of entropy and information. Kolmogorov complexity 
measures the information content of an object (individual) 
by the shortest binary computer program that fully describes 
the object [3l [19] . At the population level, the average Kol- 
mogorov complexity often has a close association with the 
formal theories of entropy and information, but it is not ex- 
actly the same. 

With respect to selection, fitness is, in essence, the match 
of characters to environmental challenge. That match de- 
pends on the algorithmic relation between the information 
content of an organism and the interpretation of that infor- 
mation through the development of phenotype. Development 
is not exactly like running a computer program encoded in the 
genes, but the analogy is not so far off. I suspect that, some- 
day, Kolmogorov complexity or related measures will help to 
understand biochemical, developmental and evolutionary pro- 
cesses. A few authors have taken the first steps |20H22j . 



The eighth section distinguishes changes by selection 
from total evolutionary change. Numerous extrinsic and 
unpredictable forces beyond selection can change the 
characteristics of populations and their fit to the environ- 
ment. I show the full expression for evolutionary change, 
placing selection in the broader evolutionary context. 
No general conclusion about total evolutionary change 
is possible, because the complete range of forces that can 
perturb populations remains unpredictable. However, we 
can express an elegant equilibrium condition. At equilib- 
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rium, the gain in information by selection must be ex- 
actly balanced by the decay in information caused by 
other evolutionary forces. 

The Discussion reviews the main argument. Classic 
equations for selection describe change by statistical ex- 
pressions of covariances, variances, and regressions. In 
terms of encoded information, the change caused by se- 
lection is the Jeffreys divergence. A generalized notion 
of Fisher information connects encoded information to 
the scale of meaning. By equating the statistical descrip- 
tion with the information description, we learn how to 
read the fundamental equations of selection in terms of 
information. 



CLASSIC EQUATIONS OF NATURAL 
SELECTION 

Equations of natural selection are often expressed in 
the statistical language of population variances, covari- 
ances, and regressions. In this section, I show how these 
statistical expressions arise from the simplest models of 
selection. Later sections connect these classic equations 
to the amount of information that a population accumu- 
lates by selection. 

Textbooks on population genetics and quantitative ge- 
netics present the classic equations of selection [23H2H]. 
Lande developed the statistical nature of selection equa- 
tions [301 131] > see also Frank [32] ■ 

Selection 

A simple model starts with n different types of indi- 
viduals. The frequency of each type is g>,-. Each type has 
Wi offspring, where w expresses fitness. In the simplest 
case, each type is a clone producing Wi copies of itself in 
each round of reproduction. 

The frequency of each type after selection is 

^=9i(~), (1) 
V w ) 

where w = <iiW% is average fitness. The summation 
is over all of the n different types indexed by the i sub- 
scripts. See Box [3] for the proper interpretation of q[. 

This equation is called a haploid model in classical 
population genetics, because it expresses the dynamics 
of different alleles at a haploid genetic locus. Recently, 
economists, mathematicians, and game theorists have 
called this expression the replicator equation, because it 
expresses in the simplest way the dynamics of replication 

[33H35]- 

It is often convenient to rewrite Eq. ([lj as the change 
in the frequency of each type, Aqi = q[ — qi. Subtracting 
qi from both sides of Eq. ([lj yields 

A ft = ft (^-l). (2) 
\ w / 



Box [3] describes a universal interpretation of these equa- 
tions for selection that transcends the narrow haploid and 
replicator models. 



Characters 

Eqn [2] describes change in frequency. How does selec- 
tion change the value of characters? Suppose that each 
type, i, has an associated character value, 2^. The aver- 
age character value in the initial population is z — Qi z i- 
The average character value in the descendant population 
is z! — ^2 q'iZ^ where z[ is the character value in the de- 
scendants (see Box[3]). For now, assume that descendants 
have the same character value as their parents, z[ = zi. 
Then z' = Y] q[zi, and the change in the average value of 
the character caused by selection is 

z' -z = A s z = 2J q'iZi - ^2 q * Zl = ^2 - ft) z i, 

where A s means the change caused by selection [5SH38) . 
We may simplify this expression by using Aqi = q[ — q L 
for frequency changes 

A s z = AqiZi. (3) 

This equation expresses the fundamental concept of se- 
lection |39j . Frequencies change according to differences 
in fitness, as given by Eq. Thus, Eq. (|3| is the change 
in character value caused by differences in fitness, holding 
constant the character values, Later, we will also in- 
clude the changes in character values during transmission 
from parent to offspring, Azi = z^ — Z{. 



Variance, covariance and regression 

Many of the classic equations of selection are expressed 
in terms of variances, covariances and regressions. I show 
the relation between the expression for frequency changes 
in Eq. (|3| and the common statistical expressions for se- 
lection. 

Combining eqns [2] and [3] leads to 

A s z = ^ AqiZi = ^ 1i _ i) z i- 

On the right-hand side, move the w term outside 

A s z = ^qi (^r - 1 j Zi = qj (Wi - w) Zi/w. (4) 

The definition of the population covariance allows us to 
rewrite this equation. Given a population of paired val- 
ues [xi ,yi), where each particular pair subscripted by i 
occurs at frequency qi , and writing x as the mean value in 
the population of the x values, the population covariance 
has the general form 
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Box 3. Interpretation of q' and z 



Box 4. Selection and information 



Classical population genetics and replicator equation analy- 
ses interpret q\ in Eq. as the frequency of type i in the 
descendant population. However, selection theory in its most 
abstract and general form requires a set mapping interpre- 
tation, in which q[ is the frequency of descendants derived 
from type i in the ancestral population. The set mapping 
interpretation arises from the Pric e eq uation [321 140H42]. 
Similarly, z[, developed in Eq. (26 1 and mentioned earlier, 



is the average value of the property associated with z among 
the descendants derived from ancestors with index i, rather 
than the usual interpretation of the character value of i types 
in the descendant population. Here, I elaborate briefly on 
these interpretations of q' and z' by adapting the presentation 
in Frank |39) . 

Let qi be the frequency of the ith type in the ancestral 
population. The index i may be used as a label for any sort 
of property of things in the set, such as allele, genotype, phe- 
notype, group of individuals, and so on. Let q[ be the frequen- 
cies in the descendant population, defined as the fraction of 
the descendant population that is derived from members of 
the ancestral population that have the label i. Thus, if i = 2 
specifies a particular phenotype, then q' 2 is not the frequency 
of the phenotype i = 2 among the descendants. Rather, it 
is the fraction of the descendants derived from entities with 
the phenotype i — 2 in the ancestors. One can have partial 
assignments, such that a descendant entity derives from more 
than one ancestor, in which case each ancestor gets a frac- 
tional assignment of the descendant. The key is that the i 
indexing is always with respect to the properties of the ances- 
tors, and descendant frequencies have to do with the fraction 
of descendants derived from particular ancestors. 

Given this particular mapping between sets, we can spec- 
ify a particular definition for fitness. Let q\ = qi(wi/w), where 
Wi is the fitness of the ith type and w — qtWi is average 
fitness. Here, Wi/w is proportional to the fraction of the de- 
scendant population that derives from type i entities in the 
ancestors. 

Usually, we are interested in how some measurement 
changes or evolves between sets or over time. Let the mea- 
surement for each i be z;. The value z may be the frequency 
of a gene, the squared deviation of some phenotypic value in 
relation to the mean, the value obtained by multiplying mea- 
surements of two different phenotypes of the same entity, and 
so on. In other words, z; can be a measurement of any prop- 
erty of an entity with label, i. The average property value is 
z = "^2 qiZi, where this is a population average. 

The value z[ has a peculiar definition that parallels the 
definition for q[. In particular, z[ is the average measurement 
of the property associated with z among the descendants de- 
rived from ancestors with index %, The population average 
among descendants is z = ^ q'iZj. 

The Price equation (Eq. ( 26 1 ) expresses the total change 
in the average property value, Az = z' — z, in terms of these 
special definitions of set relations. This way of expressing to- 
tal evolutionary change and the part of total change that can 
be separated out as selection is very different from the usual 
ways of thinking about populations and evolutionary change. 
The set mapping interpretation allows one to generalize equa- 
tions of selection theory and total evolutionary change to a 
much wider array of problems than would be possible under 
the common interpretations of the terms. By following the 
set mapping approach, our evaluation of selection and infor- 
mation can be presented in a much simpler and more general 
way. Note that the classic interpretations of the haploid and 
replicator models are special cases of the generalized set map- 
ping expressions. 



No one seems to have provided a full development of the re- 
lations between selection and information. In many respects, 
R. A. Fisher created the key concepts. However, before I start 
listing aspects of the problem and related citations, I cannot 
resist quoting from Li and Vitanyi ,19, p. 96] about the dif- 
ficulties of attribution. In discussing the name "Kolmogorov 
complexity" for the discipline of the algorithmic analysis of 
complexity, they note that Solomonoff published the key idea 
before Kolmogorov, although Kolmogorov later discovered the 
idea independently and developed it more deeply and thor- 
oughly. Ultimately, Kolmogorov got almost all the credit, 
perhaps because he was much more famous than Solomonoff. 
Li & Vitanyi summarize as follows. 

Associating Kolmogorov's name with the area 
may be viewed as an example in the sociology of 
science of the Matthew effect, first noted in the 
Gospel according to Matthew, 25: 29-30, "For 
to every one who has more will be given, and he 
will have in abundance; but from him who has 
not, even what he has will be taken away." 

Fisher [33] discussed the relation of his fundamental the- 
orem of natural selection to the second law of thermody- 
namics, a universal law about changes in entropy. However, 
Fisher never came around to an information perspective in 
this discussion and, perhaps for that reason, was restrained 
in his enthusiasm for the analogy. Alternatively, Fisher's re- 
straint may have had to do with the high dimensionality of 
the evolutionary problem [33]. However, one of Fisher's great 
contributions in his book was his use of the average effect 
to reduce the dimensionality required for analyzing selection. 
Although, Fisher never developed an information analysis of 
selection, one must remember that the modern field of infor- 
mation theory only began with Shannon's work on commu- 
nication [141 115] . The use of Fisher information outside of 
statistical problems developed later. 

The analogy between selection and information is obvious 
and has been mentioned often. However, brief mention of the 
analogy does not, by itself, provide any real insight about the 
connections between information and selection or new ways 
in which to understand selection. 

Edwards [44] noted that, in the continuous-time limit, the 
fundamental equations of selection can be expressed in terms 
of Fisher information. However, he concluded that the anal- 
ogy between selection and Fisher information provides little 
insight. By contrast, Frieden et al. [35] argued that selec- 
tion expressed in terms of Fisher information is indeed sig- 
nificant. Although I believe Frieden et al. were on the right 
track, their particular analysis and presentation did not add 
much. Fisher information is always information about an un- 
derlying scale. Frieden et al. concluded that natural selection 
provides a measure of Fisher information about time, which I 
think is the wrong scale on which to interpret meaning. The 
present article extends the start made in Frank [2]. 
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^2 Qi{ x i - x )Vi = Cov(x, y). 

Note that the right-hand expression in Eq. Q has the 
form of the covariance definition, so we can write 

A s z — ^<?i (wi - w) Zi/w = Cov(w, z)/w, (5) 

following Price [46]. The standard definition of a regres- 
sion coefficient of y on x is the covariance of y and x 
divided by the variance of x. Thus, the regression of 
fitness, W, on character, z is 



ftwz — 



Cov(w, z) 



(6) 



where V z denotes the variance of z. This expression im- 
plies Cov (w,z) = f3 wz V z . We can also reverse the order 
of the regression, Cov(w,z) = (3 ZW V W . Thus, Eq. ^ is 
equivalently 



PwzV z /w = fi zw V w /w. 



(7) 



Because z can be the value of any character, we can use 
fitness, w, in place of z, yielding 



A s w = V w /w, 



(8) 



where the regression has disappeared because the regres- 
sion of a variable on itself is one, thus (3 WW = 1. This 
expression shows that the change in mean fitness is the 
variance in fitness, normalized by the initial mean value. 

All of these expressions assume that character values 
do not change between parent and offspring, Azi — 0. As 
I mentioned, I will take up changes during transmission 
in a later section. 



Box 5. Entropy, information and stochastic evolu- 
tionary models 

The most interesting development of the theory arises from 
stochastic models of evolutionary change framed in terms of 
entropy and statistical mechanics. Iwasa |47| derived a gen- 
eral expression for "free fitness" by analogy with free energy 
and entropy. Iwasa showed the analogy between the contin- 
ual increase of free fitness in evolutionary models and the 
second law of thermodynamics, by which entropy continually 
increases. He also calculated the distributions in population 
characteristics as they change under various stochastic models 
of evolutionary change. 

These kinds of stochastic evolutionary models require cer- 
tain assumptions in order to achieve continual increase in en- 
tropy or free fitness. There is certainly no universal law about 
the increase of fitness in evolution, whereas restricted notions 
of selection may have universal properties. I have drawn a 
sharp distinction between selection and evolution in my own 
analyses. The evolutionary literature does not always make 
that distinction so clearly. 

de Vladar and Barton [4H] reviewed the significant ad- 
vances in the use of entropy and statistical mechanics to study 
evolutionary dynamics, including their own contributions to 
the subject [491 150] . This work on stochastic evolutionary 
models may eventually converge with general studies of en- 
tropy, information and dynamics. For example, there has 
been recent discussion about a maximum entropy production 
(MEP) principle for dynamics [5TH55] . In the MEP theory, 
the most likely dynamical path is associated with the greatest 
production of entropy. Further, the probability distribution 
over dynamical paths may be a function of the relative en- 
tropy production associated with the different paths. 

One may be able to use the distribution of entropy changes 
over paths to calculate the stochastic evolution of popula- 
tions. Under some conditions, one may be able to specify 
the expected probability distribution over types when the 
population achieves certain kinds of equilibrium. However, 
a full understanding of MEP and its limitations has yet to 
be achieved. There may be some relation between dynamics 
analyzed in terms of Fisher information [54] and MEP. How- 
ever, I do not understand the similarities and differences of 
those approaches. 



SELECTION EXPRESSED AS CHANGE IN 
INFORMATION 

This section derives a new result that connects the 
change in fitness caused by natural selection to the 
amount of information accumulated by the population. 
In particular, I express the change caused by selection 
in terms of a classical measure of information from for- 
mal information theory. Those readers unfamiliar with 
information theory will find some new expressions in this 
section, presented without explanation. The following 
sections explain the meaning of the expressions from in- 
formation theory and the connection to natural selection. 
(See Boxes [4]-[6] for prior work on selection and informa- 
tion.) 



Change in log fitness 

Fitness captures the notion of a match between a type 
and the environment. We may therefore expect that fit- 
ness is, in some way, an expression of the information in 
the population about the environment. Those types with 
high fitness increase in frequency, increasing the fitness 
(information) contained in the population. 

From Eq. 0, we can write the fitness of a type, Wi, in 
terms of current frequencies, and updated frequencies 
after selection, g-, as 
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Fitness depends on the ratio of frequencies, q\jqi- En- 
tities that depend on ratios have a natural logarithmic 
scaling [55]. Therefore, we should use the logarithmic 
scale when analyzing fitness [56 . It is traditional to de- 
scribe the logarithm of fitness as the Malthusian expres- 
sion, mi = log(u)i), yielding 

mi = log(w;) = log(to) + log (^j . 

Using z = m as our character in the selection expression 
of Eq. Q, we have the increase in mean log fitness by 
natural selection as 

An information measure for the change in fitness 

Perhaps the most important measure of information 
in communication, statistics and physics is the Kullback- 
Leibler divergence 

wn?) = E^ lo s(f)- ( 10 ) 

This divergence has directionality from the initial pop- 
ulation, q, to the updated population after selection, q' 
(see Box [2| . Using this definition for V in the expression 
for the change in fitness given in Eq. @, we obtain 

A s m = V(q'\\q)+V(q\\q'). (11) 

This expression is the sum of Kullback-Leibler diver- 
gences taken in each direction between the initial pop- 
ulation, q, and the updated population after selection, 
q' . In information theory, this sum is known as the Jef- 
freys divergence 

J{q',q)=V{q'\\q)+V{q\\q'). (12) 

Thus, we have the simple expression for the change in 
mean log fitness caused by natural selection as 

A s m = J (13) 

where J is shorthand for J(q',q). Equating this expres- 
sion with Eq. ([7]), using m = z, we have 

J = l3 wm V m /w = PmwVw/w, (14) 

Thus, the variance in fitness is proportional to the in- 
formation divergence, J. The regression terms divided 
by w give the constants of proportionality that adjust 
for the different scales of measurement for fitness, w or 
m = log(w). This expression shows the relation between 
the information accumulated by natural selection, J, and 
the traditional statistical expressions of natural selection 
in terms of variances and regression coefficients. 



Box 6. Bayesian interpretations of selection 



Bayesian updating combines prior information with new in- 
formation to improve prediction. The Bayesian process makes 
an obvious analogy with selection. The initial population 
encodes predictions about the fit of characters to the envi- 
ronment. Selection through differential fitness provides new 
information. The updated population combines the prior in- 
formation in the initial population with the new information 
from selection to improve the fit of the new population to 
the environment. I am sure this Bayesian analogy has been 
noted many times. But it has never developed into a coherent 
framework that has contributed significantly to understand- 
ing selection. 

Part of the problem is that the analogy, as currently de- 
veloped, provides little more than a match of labels between 
the theory of selection and Bayesian theory. As Harper 15 7| 
shows, if one begins with the replicator equation (Eq. Jll), 
then one can label the set {qi} as the initial (prior) popu- 
lation, {wi/w} as the new information through differential 
fitness, and {q'i} as the updated (posterior) population. Shal- 
izi [5H] presents a similar view. The analogy provides a useful 
correspondence between the structure of the theories but, by 
itself, does not provide any truly significant insight into se- 
lection. It may be possible to develop the analogy in useful 
ways, a challenge that remains open. 

Another Bayesian line of study analyzes how individuals 
adjust their characters in response to information obtained 
directly from the environment. Those studies include learn- 
ing, phenotypic plasticity, and various aspects of conditional 
development. By one view, learning and other processes that 
accumulate information follow Popper's [55] dictum that all 
new knowledge must ultimately derive from trial and error, 
in effect, from selection. 

Vast literatures discuss information theoretic and Bayes- 
ian interpretations of learning, which are beyond our scope. 
In an explicitly selectionist view, Fernando et al. jBU] ana- 
lyze theories of neural development in relation to Bayesian 
updating — part of the wider field of developmental selection 
[61H63] . Closer to the standard evolutionary interpretation of 
selection, Donaldson-Matasci et al. [64] provide an interest- 
ing discussion of information directly acquired from the en- 
vironment in relation to fitness. Frank [421 Section 6.3] used 
a Bayesian analysis to combine selectively acquired informa- 
tion by the population as a prior state with new information 
acquired directly from the environment (learning). 



THE ENCODING OF INFORMATION 

Before continuing to discuss the relation between se- 
lection and information, we need some additional back- 
ground about the nature of information. I first describe 
an example in which an observation provides informa- 
tion. I then discuss how to quantify the amount of infor- 
mation. Finally, I analyze the amount of information in 
a comparison, which provides the basis for comparing the 
information in a population before and after selection. 
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Statistics and information 

In statistical problems, the divergence, T>, measures 
the amount of information in an observation with re- 
spect to discriminating between two distributions [3J [5] . 
Suppose the true underlying probability distribution is 
q' . However, we do not know whether we are sampling 
from q' or an alternative distribution q. The different 
distributions may be associated with different values of 
a parameter, 9' and 9. The parameter may, for example, 
be the mean or the variance. 

When we take a sample from the true underlying dis- 
tribution, g', how much information do we obtain about 
whether the sampled distribution is q' or ql In the para- 
metric case, how much information do we obtain about 
whether the parameter of the distribution from which we 
sampled is 9' or 91 

For each observation, with value associated to the in- 
dex i, the relative likelihood of obtaining that observation 
from the true distribution, q' , versus the alternative dis- 
tribution, q, is the ratio q'Jqi- The log of the likelihood 
ratio is log(<7,-/qi). Because the true distribution is q', 
the actual probability of observing i \% q\. Thus, averag- 
ing the log-likelihood ratio over the probability of each 
observed i value gives the average log-likelihood ratio, 
which is 



V(q'\\q) = Y,q>s(£) 



The divergence T> is simply the average log-likelihood ra- 
tio, which means an average of the relative weight of 
evidence in favor of q' as the true distribution compared 
with q. The greater the ratio of likelihoods, the greater 
the divergence between distributions, and the greater the 
information in each observed value to discriminate be- 
tween the distributions. 



The scale of information 

Clearly T> gives a measure of information provided by 
an observed value. But what sort of scale, or units, does 
that measure have? If, for example, T> — 2, then what 
does the value "two" mean? 

The Shannon measure of information is commonly 
used. That measure is related to entropy, which means 
randomness. The more random something is, the less in- 
formation we have about it. For example, if a flipped coin 
comes up on either side with equal probability, we say 
that it is completely random. We also say that we have 
no information about which side is likely to come up. The 
Shannon measure captures this duality between increas- 
ing randomness and decreasing information or, equiva- 
lently, between decreasing randomness and increasing in- 
formation. 

The Shannon measure is 



(15) 



We can use any base for the logarithm. It is sometimes 
convenient to use base 2, in which case H is the average 
number of bits required to encode a message. This bit- 
encoding interpretation arises from the fact that 

-log 2 (ft) = log 2 (l/<2,) 

expresses the number of bits required to encode a prob- 
ability. For example, if qi is 1/32, then — log 2 (l/32) = 
log 2 (32) = 5 bits. A bit is the number of digits in base 
two required to express a number. The number 32 in 
base 2 is 10000, a bit-string with 5 digits. Each digit is 
a bit that takes on a value of either or 1. 

To encode a probability 1/32 requires 5 bits. By 
contrast, to encode a probability of 1/2 requires only 
log 2 (2) — 1 bit. It takes 4 bits more to encode 1/32 
compared with 1/2. The key idea is that a rarer event, 
with lower probability, q, provides greater surprise when 
the event actually occurs. A greater surprise means a 
greater distinction from what was expected, a lower abil- 
ity to predict, more randomness and less information. 
Thus, more bits means more randomness and less infor- 
mation, providing a scale for measuring information in 
terms of bits. 

The number of bits associated with each probability 
concerns only that particular probability. How should 
we measure the randomness and information over a set 
of different possible outcomes? For a distribution, q, 
with different probabilities qi for each outcome, z, we 
must combine the randomness (bits) associated with each 
probability, — log 2 (qi), and the chance that the event i 
occurs, qi. 

In particular, the randomness associated with each 
event is the product of how often the event happens mul- 
tiplied by the randomness of that event, — q^ log 2 (qi). The 
total over all events is the sum given in the definition for 
H{q) in Eq. (15), which measures the total randomness 
over a set of events. 

To understand the notion of total randomness over a 
set, we can think of each i as a symbol to be communi- 
cated or an event that may occur. A message, or a set of 
events, has frequencies qi. In such a set, each — log 2 ((/,;) 
is the number of bits required to encode each i, and the 
event i occurs with frequency qi, so — log 2 (qi) is the rel- 
ative cost in terms of bits required to encode event i. If 
the message, or set, is highly random, it takes a lot of bits 
to encode the message. High randomness corresponds to 
a high average level of surprise per event, which means 
that we have relatively little information. 

Note that information is the opposite of randomness 
and entropy. The measurement of information can be 
expressed as the negative entropy, —H. 



The information in a comparison 

The problem with —H as a measure of information is 
that, by itself, it does not give a sense of comparison 
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or information gain. In the statistical example, we com- 
pared two distributions and the information gained to 
discriminate between those distributions provided by an 
observation. In terms of selection, we will be concerned 
with the information gain by a population before and af- 
ter evolutionary change, requiring a comparison between 
the initial and updated probability distributions that de- 
scribe the population before and after selection. 

In a comparison, one way to measure a gain in informa- 
tion is by the reduction in the number of bits required to 
encode, or to predict, the distribution of outcomes in one 
population relative to another. A reduced number of bits 
corresponds to reduced randomness, and reduced ran- 
domness corresponds to improved prediction and more 
information. Thus, we can measure information gain by 
the reduction in the number of bits. 

To make comparisons, we need an expanded definition 
of entropy 



H(r,p) = -^2nlog 2 (pi 



(16) 



where H(r,p) is the entropy in the probability distribu- 
tion r when encoded by the associated probabilities p. 
This expression may be interpreted by thinking of the 
different i values as symbols in an alphabet, the as 
the frequency of the symbols in a message, and the pi 
as the frequencies used to determine the encoding of the 
symbols i. Then H(r,p) is the average number of bits 
required to encode a message r in a code based on p. 

To compare populations, suppose an updated popula- 
tion has probabilities of types (events) q' i7 and entropy 
H(q',q') = H(q'). By contrast, the entropy of the new 
population, when using the encoding of the old popula- 
tion, q, before new information was acquired, is H(q' , q), 
which is the randomness in the new population when en- 
coded by the old frequencies. 

In the updated population, the change in information 
obtained from the updated encoding is the average num- 
ber of bits to encode q' based on the new frequencies, 
H(q' , q'), minus the average number of bits to encode q' 
based on the old frequencies, H(q' , g), which is 

- (H(q>, q') - H(q>, q)) = £ q[ log 2 (^) - £ 4i log 2 (ft) 



'82 



(17) 



where the initial minus sign is used to express negative 
entropy, which is information. The term \og 2 (q' i /qi) is 
the number of extra bits to encode q[ given a prior as- 
sumption that event i happens with probability qi. The 
expression T> measures the average number of extra bits 
needed when encoding the new population by the old 
frequencies rather than with the new, updated frequen- 
cies. Thus, T> is the average gain in information in a 
population update when measured in terms of number 
of bits. A value of T> = 2 means that an efficiency gain 
of two bits has been achieved by the extra information 



provided. Alternatively, we may say that the new infor- 
mation enhances predictability, such that the remaining 
randomness, or unpredictability, has been reduced by two 
bits. 



SELECTION AND THE MEANING OF 
INFORMATION 

The encoding interpretation of information is well 
known and widely accepted 3, 5j. By contrast, a formal 
interpretation of natural selection in terms of information 
has never been developed in a simple, clear, and widely 
agreed manner. Here, I give my interpretation of natural 
selection and information. 



Why J rather than 2?? 

To analyze meaning of information with regard to nat- 
ural selection, we must begin with the fundamental ex- 
pression of selection in terms of information divergence 
given in Eq. (13) as A s m = J. That expression states 



that the change in mean log fitness is the Jeffreys diver- 
gence, J. Recall the definition of J from Eq. ( 12 ) as 



J(q',q)=V{qi\\q)+V{q\\q>). 

In most statistical and physical applications, measures 
of divergence and information typically use T> [3]. For 
example, Bayesian updating can often be expressed in 
terms of a prior distribution, q, an updated distribution 
based on new data, q' ', and the divergence of the updated 
distribution from the prior, T> (q'\\q). In the Bayesian ex- 
pression, T> describes the gain in information measured in 
terms of bits and interpreted with regard to the efficiency 
of encoding information or, equivalently, the reduced ran- 
domness and increased predictability of outcomes. 

The measure V is asymmetric, because r D(q'\\q) 7^ 
T> (qWq 1 ). By contrast, J is symmetric, because it is the 
sum of the divergence in each direction. The symmetry 
in the selection equation arises because, from Eq. Q, we 
have 



A fl 



= X) A ?iP°e(9i)- lo e(«)] 

= ^A % [Alogfe)]. 



(18) 



If we switch q\ and then Aqi changes sign, and 
A log ((ft) also changes sign. The two sign changes can- 
cel. Thus, we obtain the same information gain when 
selection moves a population as q —> q' or in the reverse 
direction as q' — > q. 
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Fitness in terms of encoded information 



NATURAL SELECTION AND FISHER 
INFORMATION 



The information expression for fitness in Eq. ( 18 ) is in 



terms of \og{q' i /qi). Thus, the information gain contin- 
ues to be about efficiency of encoding or, equivalently, 
the reduced randomness and increased predictability of 
outcomes. We could, for example, think of an increase in 
mean log fitness as an increase in the population's pre- 
diction of, or match to, the state of nature — the fit of the 
population to the environmental challenge. 

This interpretation of fitness in terms of encoding is 
universal, in the sense that the particular environmen- 
tal challenges and the particular meaning of the gain in 
fitness with respect to particular characters do not en- 
ter into the expressions. The universal expression of fit- 
ness and selection in terms of probabilities and encoding 
yields the match between changes in mean log fitness and 
changes in the classical expressions of information. 



Encoding versus meaning 



The great power and universality of the classic the- 
ory of information arises because it does not depend on 
meaning. Information is formulated strictly in terms of 
encoding, bits, randomness, and predictability, indepen- 
dently of what is being encoded or predicted. Fitness 
obtains the same universality, because fitness uses the 
same expressions of relative frequency as the classic in- 
formation measures. That universality for fitness makes 
sense, because fitness is a general expression for the way 
in which populations accumulate information, indepen- 
dent of the characters and environmental challenges that 
distinguish particular cases. 

Although it is certainly beneficial to have a univer- 
sal expression of fitness in terms of information, we pay 
for that universality by the limited scope of fitness ex- 
pressed only in terms of encoding. Information is about 
predictability, and predictability is always predictability 
about something. Natural selection must, in some way, 
be about the increased information with respect to the 
environmental challenges that shape success. How can we 
bring this particular meaning of the information about 
environmental challenges into the formulation of fitness? 

There is perhaps no universal way to express meaning 
with respect to information. That may be why the en- 
coding interpretation has been so valuable. The following 
sections explore two related ways in which to bring mean- 
ing into the information interpretation of fitness. The 
next section develops the notion of Fisher information. 
Later sections present the idea of a coordinate system 
for information and evolutionary change — a connection 
between the Price equation and information. 



Shannon information is not really informa- 
tion as such, but rather the capacity to trans- 
mit information, whereas Fisher information 
is truly a measure of informativeness about 
something specific, the value of a parameter. 
Shannon's refers to the medium, Fisher's to 
the message [44l p. 6]. 

We have been working on the scale of encoded informa- 
tion. That scale depends only on probability distribu- 
tions, without any explicit connection to what sort of 
events or meaning attach to the probabilities. Units of 
encoded information can be measured in terms of bits. 
[The following extends [2]. 

One way to interpret meaning is to change the scale. 
Suppose we could relate bits of encoded information to 
a new scale on which we interpret meaning. To relate 
the change in information to the change in meaning, we 
could evaluate 

. ( Ainformation \ . 

Amformation = — Ameamng. (19) 

\ Ameaning / 

The relation is trivial when expressed in this way. How- 
ever, we can see that the ratio of change in information 
to change in meaning provides the translation between 
the two scales. 

To make this expression for the relations between the 
scales useful, we must connect each of the terms to our 
prior discussion of information and to a new way of de- 
scribing meaning. That connection leads us to expres- 
sions of natural selection in terms of the fit of characters 
to the environment, rather than the efficiency of encoding 
information in terms of bits. 

Up to this point, I have been writing q^ or q\ for the 
probability of event i, whatever sort of event or charac- 
teristic i may be. The probability distribution is the set 
of qi values over the range of possible characters, each 
possible character associated with a label i. In this for- 
mulation, one can think of the probability distributions 
as interpreted nonparametrically, in the sense that we 
work directly with the actual distribution of probabili- 
ties without reference to any underlying parameters or 
causes. 

Now suppose we associate a set of values, 8, with each 
probability distribution [65] . We could think of 8 as a pa- 
rameter, for example, the mean of the distribution. Or 
we could think of 8 as the predictions about the envi- 
ronment associated with a probability distribution. The 
predictions might be expressed as characters. The qual- 
ity of the predictions could be associated with fitness. 

For now, we take 8 in the general sense of some values 
associated with a distribution. To express the associa- 
tion, we expand our notation for probabilities to write 
qi\8, the probability of event i given the associated value 
8. An updated population may have a new associated 
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value, 6' ', such as a new mean or a new prediction about 
the environment, so we write q[\9' . The change in prob- 
ability is now expressed as 



Aq t \9 



q% I 



To express the scaling of probability changes relative to 
changes on the new 9 scale, we can divide both sides by 
the change on the 9 scale, yielding 

A q% \9 = g' l \e l -q l \e 

A9 6' -6 ' 

This expression gives us a way to match changes on the 
scale of meaning, 9, to changes on the scale of probability 
and encoded information, q. 



We can now follow Eq. ( 19 1 to express the change in 



information as the change on the scale of meaning multi- 
plied by the change of information scaled relative to the 
change in meaning. To develop this expression, we must 
continue to match our previous work on information and 
selection to the new notation in relation to meaning. 

The log-likelihood ratio, \og(q' i /qi), can be written as 
l°g(<Zi) ~ l°g(<7i)i which may be abbreviated as Alog(<7;), 
as in Eq. (|l8|) . This difference of logarithms expresses the 



change in the number of bits required to encode the prob- 
abilities associated with i (as described below Eq. (IT)). 
If we now express probabilities in relation to 9, as q\9, 
and divide by A9, we obtain the change in the number 
of bits in relation to the change on our scale of meaning 



logoff) - log( ?i 



Alog( gi |fl) 
AO 



We can now put the pieces together by relating these 
new expressions with the expression in Eq. ( 18 1 for the 



change in mean log fitness, yielding a form equivalent to 
the intuitive description in Eq. ( 19 1 as 



m 

A9 2 



A9 2 



(20) 



in which I write A9 2 = (A9) 2 for the square of the change 
in the parameter, and the term J{9) is the Jeffreys diver- 
gence, which is now a function of the scale of meaning, 
9, and is written as 



J(0) = £(A«»|0) [Alog( ft |0)] 



(21) 



These expressions simply repeat our prior derivation of 
A s fh = J, but with explicit consideration of 9. 

As the changes become small, A9 — > 0, the Jeffreys 
divergence, J (9), divided by the squared change in scale, 
A6 , converges to the important quantity in statistical 
theory known as Fisher information, F(9), which we 
write as 



m 

Aff 2 



as shown in Appendix A. Thus, for small changes on the 
scale of meaning, A9 — > 0, we may write the change in 
average log fitness as 



F(9)A9 2 



(22) 



This derivation provides a more general way to arrive 
at my earlier statement that changes in mean fitness are 
proportional to Fisher information [2]. Fisher informa- 
tion is the information in an observation about a param- 
eter, or a set of parameters. In our case, 9 represents the 
parameters, which is our scale of meaning. 

One can also think of Fisher information as the Jef- 
freys divergence between populations, J (9), relative to 
the squared divergence on the scale of meaning, A9 2 . 
Thus, Fisher information is the sensitivity of change in 
the encoded information in populations, J {9), relative to 
change on the parametric scale of meaning. The greater 
the sensitivity, the more information in an observation 
with respect to the divergence between populations on 
the underlying parametric scale. See Appendix B for 
ways in which Fisher information has been used in pre- 
vious models of selection. 



PARAMETRIC COORDINATES FOR 
SELECTION AND INFORMATION 

The change in mean log fitness measures the amount 
of information that the population accumulates by se- 
lection. Because fitness describes changes in relative fre- 
quencies, fitness concerns encoding of information, which 
can be measured in numbers of bits. 

The previous section showed how to convert from bits 
to an alternative scaling of information in terms of 9. 
We may interpret the parameters 9 as a scale that has 
meaning with respect to the fit of the population's char- 
acteristics to the environment. This section further an- 
alyzes the notion of parametric coordinates for selection 
and information, followed by an example. 



Parametric coordinates and Fisher information 



From Eq. (20), the key result for the change in mean 



log fitness in terms of a parametric scale can be rewritten 

as 



A s m _ J{9) 
~Aff 2 ~ 



A9 2 



F(9). 



(23) 



Change in mean log fitness is the amount of information 
gained by selection. The ratio A s m/A9 2 is the change in 
information per unit change in squared distance on the 
parametric scale. Because we consider the parametric 
scale as the scale of meaning, this ratio is the change in 
information relative to the change in squared distance on 
the scale of meaning [65] . The arrow on the right-hand 
side states that the relative change in information per 
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unit of squared parametric distance is the Fisher infor- 
mation in an observation about the parameter, 8. 

The interpretation of "observation" with respect to 
natural selection is interesting. Each interaction of an 
individual with the environment leads to a realized fit- 
ness. That realized individual fitness is an observation, 
by the population, of the fit between certain character- 
istics and the environment. For a particular type, i, the 
average information in each observed individual fitness is 
log(^M) = Alogfell?). Thus, the ratio Alog( 9i |0)/A0 
is the change, or sensitivity, of information in an ob- 
servation relative to a change in 8. To get the average 
over all types, i, we weight this information per type by 
qi\9. To analyze selection, we need the change in frequen- 
cies, or sensitivity of those changes, relative to changes 
in 8, which is Aqi\8/A8. Combining these terms yields 
J(8)/A8 2 -> F{8). 



Change in the variance of a normal distribution under- 
stabilizing selection 

The previous example described directional selection 
on the average trait value, holding the variance constant. 
This section considers stabilizing selection. In this case, 
the population begins with its center at the optimum. 
Selection reduces the variance, but leaves the mean un- 
changed. For a normal distribution, the Fisher informa- 
tion in an observation about the variance, v, is l/2v 2 . 
Thus, 

Av 2 

A s fh = F(v)Av 2 = — T , 

which is the gain in information when stabilizing selection 
reduces the variance of a normally distributed character. 



Change in the mean or variance of a character 



Change in the mean of an exponential distribution 



A few examples clarify the abstract expressions for in- 
formation. To keep things simple, I assume small changes 
so that we can use the Fisher information simplification 
in Eq. (23 1. With larger changes, we could make exact 



calculations using J(9) instead of Fisher information. 



Change in the mean of a normal distribution under 
directional selection 



Suppose the character values in a population, z,-, fol- 
low a normal distribution with mean, /i, and variance, v. 
An observation from that population provides informa- 
tion about the mean of the population. It is well known 
that an observation from a normal population provides 
Fisher information about the mean of = 1/v. The 
more variable the population, the larger v, and the less 
information in an observation about the average value. 
Put another way, the precision in measurement is pro- 
portional to 1/v. More variable populations yield less 
precise measurements, and thus less information per ob- 
servation about the average value. 

We interpret natural selection as obtaining information 
through the observed fitnesses associated with character 
values. Suppose that the population retains a normal 
shape and a fixed variance before and after selection, and 
changes only in its mean value. Then the change in the 
mean, A/i, is sufficient to describe the effects of selection. 
From Eq. (22 1, the increase in information by natural 
selection is 



A s m = F(/i)A/i 2 



A/i 2 



This expression provides the relation between the change 
in information, A s fh, which is a universal abstract quan- 
tity about encoding, and the scaling of the character that 
gives meaning for this particular case, A/j, 2 /v. 



Suppose the character follows an exponential distri- 
bution before and after selection. An observation from 
an exponential population provides Fisher information of 
1/v about the mean, fj,. The variance of an exponential 



distribution is 
selection is 



fi . The change in information by 



A a fh = fV)A/i 2 = 



A/i 2 



which matches the case of the normal distribution. How- 
ever, the variance of the exponential distribution changes 
with the mean. By contrast, the normal distribution has 
a separate parameter for the variance, which we held con- 
stant by assumption. 



Change in allele frequency 

Suppose qi — p is the frequency of a particular allele, 
and q<j = 1 — p is the frequency of the alternative allele. 
The distribution of allele frequencies is binomial with a 
single observation. The mean allelic value is \x = p, and 
the variance is v = p(l — p). The Fisher information in 
an observation about the mean of a binomial population 
is 1/v. The change in information by selection is 



F(/x)A/, 2 = 



A/i 2 



Using p for gene frequency to match the familiar notation 
of population genetics 



A s rn = F(p)Ap 2 = 



Ap 2 



p{l-p) 



which holds when A/i = Ap is small. For larger changes, 
we can obtain an exact expression by using the Jef- 
freys divergence rather than the Fisher information, as 



in Eq. (23 I 
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CHARACTER COORDINATES AND SELECTION 

The previous section assumed that the parameters, 9, 
summarize all differences in the frequency distributions 
before and after selection. We can think of 8 as defining 
the coordinate system for evolutionary change. The re- 
duction of frequencies to a parametric description, such 
as the mean of the distribution, typically requires charac- 
ter values to be associated with the i values. By conven- 
tion, we use Zi for character values. Thus, if changes in 
the mean are sufficient to describe changes in the proba- 
bility distribution of characters in the population before 
and after selection, then /x = z — ^ qiZi is a reduction of 
the full distribution of character values to a single para- 
metric dimension. 



Parametric character coordinates 

Let us review the use of parametric coordinates before 
discussing nonparametric coordinates. In a parametric 
example, suppose that frequencies before and after selec- 
tion are normally distributed, with parameters (/x, v) for 
the mean and the variance. Selection moves the popula- 
tion from the initial location, defined by the parameters 
(fi,v), to the location after selection, (fi',v'). The two 
parametric dimensions provide a complete description of 
change by selection. If we hold one parameter constant, 
such as the variance, and only allow the mean to change, 
then change in the single parametric dimension from /x 
to /jf fully describes the population before and after se- 
lection. 

Parametric expressions describe the total change in in- 
formation by 

A s fh = ^A6 2 -> F{9)A6 2 . 

For example, let the parameter be the mean, 6 = /i. The 
term J(/x)/A/x 2 — > F(fi) reduces the change in the aver- 
age information per observation to the single dimension 
of fi. If we multiply the information per observation by 
the distance moved in the parametric dimension, A/x 2 , 
we obtain the total change in information. Thus, the 
calculation for the change in information is done along 
the single parametric dimension of /x. 

The parametric dimension of /x can be thought of as 
the coordinate system in which we evaluate change by 
selection. Each change in position along the coordinate 
of fi corresponds to changes by selection, because /x is a 
sufficient description for the full frequency distribution 
of character values. In general, when we can reduce the 
description of frequency distributions to a sufficient set 
of parameters, 9, then those parameters form the coordi- 
nates in which we evaluate changes by selection. 



Nonparametric character coordinates 

We can think of our fundamental expression for selec- 
tion 

A S Z = ^ ^1'i z i 

as a nonparametric expression. Each term includes the 
actual frequencies in the population. The calculation is 
done over the full dimensionality of the frequency distri- 
bution. 

The character values, {zi} = z\,Z2,---, form a non- 
parametric coordinate system. For the population fre- 
quencies, {qi}, the point {qiz{\ locates the population 
before selection, and the point {q[zi\ locates the popu- 
lation after selection. The movement of the population 
caused by selection is given by {Ag^}. 

The expression for the total change in information 
caused by selection is 

A s m = J = A^Alogfe) = J2 A H lo § (j^j ■ 

Each frequency change, Aqi, associates with the charac- 
ter Zi — Alog(<7i), the change in information for the ith 
type. This is a nonparametric expression, because the 
calculation is done over the full frequency distribution. 

Character coordinates and information 

The character values provide the coordinates of mean- 
ing in an analysis of selection. We can derive the relations 
between information and the coordinates of meaning by 
using the results of eqns [7] and [S] From those equations, 
we obtain the relation between the change given the co- 
ordinates of meaning, A s z, and the change given the co- 
ordinates of information, A s to, as 

A s z = f!^) A s m. (24) 

The term f3 zw is the regression coefficient of the charac- 
ter values, z, on the fitnesses, w. The term /3 mu , is the 
regression coefficient of the log fitnesses, to, on the fit- 
nesses, w. These regressions provide an exact expression 
for changing the coordinates from information, A s m, to 
characters, A s z. When the magnitudes of the changes 
are small, w ~ > to + 1, thus 

A s z ->• P zm A s fh. (25) 

To repeat, it is important to recognize a regression coef- 
ficient as an exact expression for the change in scale as- 
sociated with a change in coordinates. The regression is 
sufficient when evaluating the consequences for a change 
in coordinates with respect to a change in mean value. 

The underlying values, Zi, may themselves be nonlin- 
ear functions of other values [39 . For example, Zi could 
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be the product of different character values measured on 
each individual, or the square of some underlying charac- 
ter. What matters is that we average over the values 
to get A s z. 

CHARACTER COORDINATES AND TOTAL 
EVOLUTIONARY CHANGE 

The previous analyses have focused on the selection 
part of total evolutionary change. I defined selection as 
the change caused by frequency differences 

A S Z = ^ ^ z i- 

The subscript s emphasizes that this expression is the 
partial change caused by selection [3"5H3"8] . 



any environmental or extrinsic factors that may change, 
altering the fit of the characters to the environment. The 
changes in the frequencies themselves can be an "environ- 
mental" change that alters fitnesses [35H35]. Thus, no 
general expression for total evolution change in fitness is 
possible other than 

Am = J + A c fh. 

One can, of course, analyze particular models such as 
mutation-selection balance. Mutation decays informa- 
tion through changes in fitness that are, on average, neg- 
ative, causing a loss of information through the term 
A c m = ^q^Arrii. The particular loss of information 
through A c m depends on the specific assumptions. By 
contrast, the gain in information through selection is al- 
ways A s m = J. 



Total change in characters 



Equilibrium balance between information gain and 

loss 



The partial change arises by holding constant the char- 
acter values, such that Azi = z[ — Zi = 0. This assump- 
tion fixes the coordinates, z il and evaluates the meaning 
of changing frequencies in the context of that fixed set of 
coordinates. 

If the coordinates that give meaning also change, 
Azi ^ 0, then we must account for that change in co- 
ordinates with respect to the total evolutionary change. 
In particular, the total change is the sum of the change, 
A s , caused by selection through varying frequencies, q, 
holding constant the coordinates, z, plus the change in 
coordinates, A c , holding constant the new frequencies in 
the updated population, q' . We write the total change as 



Az = A s z + A c z 



(26) 



This expression is a form of the Price equation. I devoted 
the prior article to a full discussion of this equation |39j . 
Here, I focus only on those aspects that concern infor- 
mation. In particular, I emphasize the interpretation of 
z as a coordinate system that gives meaning to the infor- 
mation basis of natural selection. 



Total change in information 



The total evolutionary change in Eq. ( 26 1 can be used 
to evaluate information. Let z = m, where the log fitness, 
m, provides a measure of the information accumulated by 
a population . Thus, 



Am = A s m + A c m. 



(27) 



From Eq. (13), the selection component of change is 
A s fh — J. In general, no simplified reduction or par- 
ticular interpretation is possible for the change in coor- 
dinates, A c m. That change in coordinates arises from 



Many processes lead to an equilibrium balance between 
gain of information by selection and decay of information 
by an opposing force [55] • Mutation-selection balance is 
one example. Frequency-dependent selection is another, 
in which the gain in information by selection is balanced 
by the decay of information (fitness) caused by frequency 
changes. For example, in the evolution of sex ratios, mak- 
ing more daughters may be favored by selection. But as 
the number of daughters increases by selection, the ad- 
vantage of making extra daughters decays. 

Although we cannot, in general, specify the change 
in the coordinate term, A c m, we can express the equi- 
librium condition, Am = 0. Under a balance between 
information gain by selection and information decay by 
change in coordinates, 

J = — A c m. 

It is sometimes possible to analyze particular problems by 
using that universal expression for the balance of forces 

SUET]. 



Evolution of the coordinate system 

In the previous sections, I have fixed the particular di- 
mensions that define the coordinate system. Although 
the coordinates may change, Azi, each dimension i re- 
mained. From a broader perspective, the evolution of 
the various dimensions in the coordinate system itself is 
perhaps among the most interesting evolutionary prob- 
lems. One aspect concerns the origin of new characters 
|68j . More generally, one may consider the evolution of 
the optimal set of characters with respect to the capture 
of information. 

There is an interesting literature in engineering about 
optimal design of sensors with respect to capturing in- 
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formation. That literature sometimes uses Fisher infor- 
mation as the optimality criterion with respect to de- 
sign Application of that design perspective with 
regard to information may provide insight into biologi- 
cal problems. For example, multiple cellular receptors 
may respond to the same sort of information, such as the 
concentration of a hormone. But those receptors may be 
tuned differently with regard to sensitivity to signals. A 
related idea concerns the common tradeoff between in- 
formativeness and simplicity in classification 70J. 

A second aspect of coordinates concerns the parametric 
reduction of the full nonparametric distribution of char- 
acters. Reducing the full distribution to the mean is an 
extreme reduction, and probably not justified in general. 
However, there often may be some suitable reduction of 
dimensionality to a sufficient set of parameters with re- 
spect to the acquisition of information [7TJ [72]. That 
sufficient set defines the coordinates of information and 
meaning followed by an evolving population. It may be 
that an improved parametric representation of informa- 
tion in the environment by a set of characters enhances 
fitness. Thus, it may be the parametric representation 
itself that is under the strongest selection or, at least, a 
particularly interesting form of selection. 

DISCUSSION 

The fundamental equations of selection are often writ- 
ten in the statistical terms of variances, covariances, and 
regressions. I have argued that one obtains a deeper un- 
derstanding of selection if one learns to read the funda- 
mental equations in terms of information. Here, I review 
my argument by listing the key steps derived in previous 
sections. I start with the classic statistical equations of 
selection. I then show the connection of those statistical 
expressions of selection to expressions for the information 
that populations accumulate about the fit of characters 
to the environment. 



Statistical expressions of selection 

To understand where the classic statistical expressions 
of selection come from and what they mean, let us start 
with the basic equation for evolutionary change by nat- 
ural selection 

A S Z = ^ ^1i Z i 

given in Eq. Here, A s z is the change caused by 

selection in the average value of a character, z. This 
expression applies generally to selection of any value. For 
example, z could be gene frequency, leading to population 
genetics expressions, or z could be a quantitative trait 
such as weight, or z could be a nonlinear function of 
several characters. The Aq t terms are the changes caused 
by selection in the frequency of the ith character value, 



Zi. Total selection is the total change in frequencies, with 
each change caused by selection, Aqi, weighted by its 
associated character value, Zj. 

I showed that one can rewrite the association between 
the change caused by selection and the character value 
as 

A^z, = Cov(w;, z)/w, (28) 

a form known as the Price equation and also related to 
Robertson's secondary theorem of natural selection (39j . 
This form provides the foundation for quantitative ge- 
netics theory, and also arises in standard models of pop- 
ulation genetics. The definition of covariance allows us 
to rewrite the covariance as the product of a regression 
coefficient and a variance term 

A s z = Gov(w, z)/w = f3 zw V w /w, (29) 

where j3 zw is the regression of character value, z, on fit- 
ness, w, and V w , is the variance in fitness. These sorts 
of regression and variance terms arise repeatedly in the 
fundamental equations of selection. 

One can easily understand why selection depends on 
an association between fitness, w, and character value, z. 
Those character values associated with higher fitness will 
increase, whereas those character values associated with 
lower fitness will decrease. But why should the expression 
for selection be exactly the covariance, or the regression 
multiplied by the variance, which capture only the lin- 
ear component of association? The reason is that A s z 
describes selection by a change in average values. To cal- 
culate a change in the average, we need only the linear 
component of association between character and fitness. 

These statistical expressions of selection in terms of co- 
variances, variances, and regressions have been very use- 
ful throughout the history of evolutionary theory. How- 
ever, these expressions give no sense of what selection 
means. To say that selection is the covariance of fitness 
and character value is simply to express an algebraic re- 
lation. That algebraic relation is very useful, but it does 
not give a sense of what selection is actually doing with 
regard to adaptation or how selection relates to processes 
in other fields of study. The statistical expressions do not 
tell us how to read the fundamental equations of selection 
with regard to the meaning of the underlying process. 



Selection in terms of information 

In this article, I argued that selection causes popula- 
tions to accumulate information about the fit of charac- 
ters to the environment. I gave a precise definition of 
"information." That definition of information with re- 
spect to selection matches exactly the classic usage of 
information and entropy from the fundamental theories 
of physics, statistics, and communication. By showing 
the exact relations between selection and information, I 
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tied the theory of natural selection to the broader con- 
ceptual framing of problems at the foundation of many 
key scientific disciplines. 

I will not repeat the whole argument here. Instead, 
I list a few steps to emphasize the essential points. To 
understand the information associated with selection and 
fitness, we must analyze fitness on a logarithmic scale 



Because z in Eq. ( 29 ) is just a placeholder for any charac- 



log! Si 



m { = log(wj) = log(» 



The logarithmic scale compares relative magnitudes. We 
need relative magnitudes because there is no meaning in 
the number of babies or the number of copies produced 
with regard to whether a type, i, is increasing or decreas- 
ing in the population. We need to know the relative suc- 
cess. The logarithmic scale is the natural scale of relative 
magnitudes. 

Using log fitness, m, as the character value of interest 
in Eq. (28), we obtain 



A s m = ^2 A ^ m i = A * log 



We recognize the fundamental expression for the change 
in information given by the Kullback-Leibler divergence, 
or relative entropy, as 



2%'lk)=$»g 



Using this definition for change in information, T>, we can 
express the change in mean log fitness caused by selection 
as 

A s m = V(q'\\q)+V(q\\q'). 

This sum of the changes in information in each direction 
is known as the Jeffreys divergence, J. Thus, we can 
write the fundamental expression for the accumulation 
in information by natural selection as 



ter, we can use to in place of z in that equation, yielding 

A s m = /3 mw V w /w. 

Thus, the information accumulated by natural selection 
is equivalently expressed in terms of the regression coef- 
ficient and variance 



J = PmwVw/w. 



(30) 



A., 



The value of J is the gain in information. The variance 
in fitness, V w , is therefore a measure of the separation 
between the initial population and the population af- 
ter selection, when the separation between populations 
is expressed on a scale of information. The regression 
divided by the mean fitness, j3 mw /w, is a scaling factor 
that translates the measure of information in V w to the 
scale of log fitness, to. That scaling change is required 
because log fitness is the proper measure of information 
in expressions of selection. 

Eqn [30] shows the equivalence between the expression 
of information gain and the expression of it terms of sta- 
tistical quantities. There is nothing in the mathematics 
to favor either an information interpretation or a statis- 
tical interpretation. 

I have argued that, when reading the fundamental 
equations of selection for meaning, we should prefer the 
information interpretation. The information perspective 
makes sense intuitively. Selection is the process by which 
populations accumulate information about the environ- 
ment. 
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APPENDIX A: FISHER INFORMATION AS THE 
LIMITING FORM OF THE JEFFREYS 
DIVERGENCE 

A large family of divergence measures converges to 
Fisher information in the limit of small changes [65j l73l - 
[75] . In this appendix, I show that the limit of the Jeffreys 
divergence is the Fisher information multiplied by a scal- 
ing factor for parametric distance. 

I also show that the chi-square divergence becomes the 
Fisher information metric in the limit of small changes. 
The different forms of divergence can be confusing if one 
does not realize that all of the different divergence mea- 
sures in the Fisher family are equivalent in the limit, but 
differ when changes are not small. 

My main point is that the Jeffreys divergence holds the 
unique position as the only correct divergence measure 
for models of selection. It is the only measure that is 
correct both for large changes and, in the limit, for small 
changes. As far as I know, my derivation in this article 
of the Jeffreys divergence in relation to selection has not 
been shown previously. The clear relation of the Jeffreys 
divergence to changes in information is essential to make 
the proper connection between selection and information. 



Limiting form of Jeffreys divergence 

I show J (9) — > F(9)A9 2 as the distance in the para- 
metric coordinates A9 2 -» 0. Notationally, A9 2 = (A9) 2 . 
Using the standard differential notation for small dif- 
ferences, we write A9 2 — > d9 2 . Thus, I show J{9) — > 
F(9)d9 2 . 

I use the vector 9 as parametric coordinates for proba- 
bility distributions, following standard analysis in infor- 
mation geometry |65j . For simplicity, I usually treat the 
parametric vector as a single dimension. The extension 
to multiple dimensions is standard. 

The Jeffreys divergence in parametric form, from 
Eq. (Pill, is 



J(0) = £(Aft|0) [Alogfel*)]. 

As the changes become small, Aqi\9 = q^\9' — qi\9 — > 
and A9 = 9' - 9 -> 0, we write 



A qi \9->dq z \9 

fd qi \e 



d9 



V d9 

= q.id6, 

where <ji is the derivative of qi\9 with respect to 9. Next, 
Alog( gi |0)->dlog( gi |0) 

'dlog^r 



d6 



d9 
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where, to make the notation more concise, I use qi = qi\9. 
Thus 

w -,£(*)<*". 

Below, I show that J^Vi/li ^ s Fisher information, F(9). 
Thus, J{9) -> F{9)d9 2 . 



Pearson's chi-square divergence 

We have from the previous expression 

^ — d«? 



•w) E 



d^ 2 = y 



(31) 



Pearson's chi-square divergence, or chi-square test statis- 
tic, is usually described as follows. Given an expected 
probability distribution, {qi}, and an observed probabil- 
ity distribution, {q[}, the chi-square statistic is the sum of 
observed minus expected squared over expected. Writing 
the observed minus expected squared as Aq 2 = {q'i~qi) 2 , 
we have 



X 



; (*) = E 



Aqf 



As the changes become small, 



x 2 (^£ — = E 



1i 



d8 2 



demonstrating that the Jeffreys and chi-square diver- 
gences have the same limiting form. The next section 
shows that the limiting form is related to the Fisher in- 
formation metric. 

When changes are large, only the Jeffreys divergence 
gives the correct expression for changes by selection in 
mean log fitness, A s m. The chi-square divergence is the 
change in mean fitness on a linear scale 



A s w = £ AqiW t = E 



Ag 2 



As I discussed in the text, the correct scale for analyz- 
ing changes in fitness is logarithmic, because fitness is a 
relative measure, and logarithmic scaling is the correct 
scale for relative measures [56j . In addition, the relations 
between selection and information are only clear on the 
logarithmic scale, because it is only on that scale that 
one can see the connections to the classic theories of en- 
tropy and information. In the limit of small changes, the 
logarithmic scale becomes linear, and thus A s m — > A s w. 



Alternative expressions for Fisher information 

One can think of Fisher information as the change in a 
probability distribution with respect to a change in a pa- 
rameter that specifies the distribution. The more rapidly 



a distribution changes with respect to a parameter, the 
more information each observation provides about the 
value of the parameter. For example, if the distribution 
changes very slowly, then small differences in the distri- 
bution of observed values may translate into big differ- 
ences in parameter values. Thus, approximately similar 
distributions of observations map to widely different pa- 
rameter values, so each observation provides relatively 
little information about the parameter. If, by contrast, 
the distribution changes rapidly with respect to a param- 
eter, then the distribution of observations is very different 
for small changes in the parameter, and each observation 
provides a lot of information about the likely value of the 
parameter. 

Mathematically, Fisher information is the negative 
value of the expected curvature of the log- likelihood func- 
tion 



F{6) 



E^ 



d 2 \og{q l \9) 

ae 2 



Doing the differentiation, and noting |65j that 



E 



d 2 ^ 
d9 2 



d 
d6 



E 



dg t \9 
d9 



= 0, 



because the sum of changes in frequencies must be zero 
over a distribution, we obtain 



F(9) = E - 



A large number of different divergence measures converge 
to Fisher information in the limit. Thus, knowing only 
that the limiting form of a divergence is Fisher informa- 
tion only weakly constrains the associated form of diver- 
gence. For example, from the expression above for the 
chi-square divergence 



<li 



d8 2 



it might be tempting, in a particular application in which 
Fisher information arises, to think of the chi-square di- 
vergence as somehow the natural measure of change, be- 
cause the chi-square form for large changes most closely 
resembles the limiting Fisher information form for small 
changes. In the case of selection, that conclusion would 
not be correct. The Jeffreys divergence is in fact the nat- 
ural measure of change, because the logarithmic scale is 
the natural scale for changes in fitness and for changes in 
information. 



APPENDIX B: HISTORICAL ASPECTS 

Kimura [75] noted that the change in fitness in certain 
models of selection is 



A s m = E 



(32) 
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Kimura used the standard notion of change with respect 
to time in his study of continuous dynamics with respect 
to small changes. Thus, the parameter is 9 = t for time, 
and q = dq/dt. 

Ewens [77] and Edwards |44j provide comprehensive 
syntheses of the literature on the various uses of Kimura's 
expression, ^qf/qi- The main use concerned informa- 
tion geometry expressions of selection dynamics on a Rie- 
mannian manifold. Neither Ewens nor Edwards found 
that discussion of information geometry particularly use- 
ful. Edwards did note that the Kimura's expression is 
in fact just an expression for Fisher information. But 
Edwards did not think that association was useful. 

I agree with the criticisms by Ewens and Edwards 
within the context of how the literature had been framed. 
From Kimura 76J through the various developments in 
the literature, the emphasis had always been on dynam- 
ics with respect to time. I agree with Edwards that one 
cannot say anything very interesting about the temporal 
dynamics of evolutionary change from the simple expres- 
sion in Eq. (32) for selection. That expression is the 
partial change caused by selection [55H55] . not the total 
evolutionary change. The partial change gives a clear 
sense of what selection is doing at any moment, but pro- 
vides no insight by itself about evolutionary dynamics. 

My presentation in this article is also based on Fisher 
information and, more generally, on the Jeffreys diver- 
gence. Two aspects of my presentation go beyond the 
past work and, in my view, provide a compelling case for 
framing our understanding of selection in these terms. 

First, I connected selection to information theory 
through the general result A s m = J, the Jeffreys di- 
vergence. This result does not depend on the limit of 
small changes, but instead is a general description of the 
nature of selection. This result establishes the proper 



measure for the amount of information accumulated by 
selection. 

Second, I related the change in information to various 
underlying parametric and nonparamctric scales. Those 
scales provide the meaning with respect to the abstract 
scale for encoded information that forms the basis for 
classical information theory. As Edwards [44] empha- 
sized, Fisher information is information about meaning 
with respect to underlying parameters [2]. Earlier work 
implicitly used time as the parameter, which is not a 
meaningful way of expressing the accumulation of infor- 
mation. One does not think of selection as providing 
information about time. In addition to making the para- 
metric basis for selection and information explicit, my 
use of the Jeffreys divergence clarified the relation of se- 
lection to classical information theory. 

Finally, I achieved greater generality than past work by 
respecting the fundamental distinction between selection 
and evolution. Past work often tried to make general 
statements about evolutionary dynamics, which is not 
possible. It is possible to make strong and completely 
general statements about the partial change caused by 
selection. Such statements clarify the relations between 
selection and information. One can achieve that depth 
and generality only by working within the fundamental 
limitations imposed by the distinction between selection 
and total evolutionary change. 

I mentioned that Ewens [77] and Edwards [44 con- 
cluded that past work based on the Kimura's result did 
not contribute significantly to understanding selection. 
Ewens [77| did develop his own extension to that theory, 
in which he showed an optimization principle in relation 
to Fisher's fundamental theorem. Frank [5] developed 
a similar idea but with a different approach that em- 
phasized information and the Fisher information metric. 
Those studies derive from a partitioning of the causes of 
fitness, which is the topic of a future article in this series. 



